This notebook is still in PROGRESS

And what about Tennis?

# ERRATA: some variables (like W1,L1, W2, L2...) are obtained after the match ends. That it's to say, they shoudn't be used to train our model. This is what is called "Data Leakage". By the time the author wrote this notebook, these informations were not clear, reason why he preserved them for training the models. For the sake of simplicity, the original notebook will be maintained, preserving not only the other analyses but also serving as an example of what happens when not having enough information about your data! ;)

The dataset we will work with here comes from the http://www.tennis-data.co.uk/alldata.php and you can find a lot of information there! It cover all data related to tennis matches from 2000 to 2020!! So it's a great dataset to train and practice your data science skills!!

This notebook will cover:

  1. Data cleasing / Preparation
  2. Performance of multiple models (lightGBM):
    1. Baseline
    2. creating Label Encoding features
    3. feature selection (Randon Forest)
    4. noise
In [1]:
import pandas as pd
from urllib.request import urlopen  
import os.path as osp
import os
import logging
import zipfile
from glob import glob
logging.getLogger().setLevel('INFO')

Helpers

In [2]:
def download_file(url_str, path):
    url = urlopen(url_str)
    output = open(path, 'wb')       
    output.write(url.read())
    output.close()  
    
def extract_file(archive_path, target_dir):
    zip_file = zipfile.ZipFile(archive_path, 'r')
    zip_file.extractall(target_dir)
    zip_file.close()

Download the dataset

In [3]:
BASE_URL = 'http://tennis-data.co.uk'
DATA_DIR = "tennis_data"
ATP_DIR = './{}/ATP'.format(DATA_DIR)
WTA_DIR = './{}/WTA'.format(DATA_DIR)

ATP_URLS = [BASE_URL + "/%i/%i.zip" % (i,i) for i in range(2000,2019)]
WTA_URLS = [BASE_URL + "/%iw/%i.zip" % (i,i) for i in range(2007,2019)]

os.makedirs(osp.join(ATP_DIR, 'archives'), exist_ok=True)
os.makedirs(osp.join(WTA_DIR, 'archives'), exist_ok=True)

for files, directory in ((ATP_URLS, ATP_DIR), (WTA_URLS, WTA_DIR)):
    for dl_path in files:
        logging.info("downloading & extracting file %s", dl_path)
        archive_path = osp.join(directory, 'archives', osp.basename(dl_path))
        download_file(dl_path, archive_path)
        extract_file(archive_path, directory)
    
ATP_FILES = sorted(glob("%s/*.xls*" % ATP_DIR))
WTA_FILES = sorted(glob("%s/*.xls*" % WTA_DIR))

df_atp = pd.concat([pd.read_excel(f) for f in ATP_FILES], ignore_index=True)
df_wta = pd.concat([pd.read_excel(f) for f in WTA_FILES], ignore_index=True)

logging.info("%i matches ATP in df_atp", df_atp.shape[0])
logging.info("%i matches WTA in df_wta", df_wta.shape[0])
In [4]:
# TON_DIR = "" + /data_4q .csv
# df4 = pd.read_csv("TON_DIR", index_col=0, low_memory=False)
try:
    df4 = pd.read_csv("../input/data-4q/data_4q .csv", index_col=0, low_memory=False)
except:
    print("\n\n\n\n\n\n\n\n\n\nIl semble que tu as oublié de faire l'upload du dataset!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n")
In [5]:
df_atp.describe()
Out[5]:
ATP Best of W1 L1 W4 L4 W5 L5 Wsets CBW ... UBW UBL LBW LBL SJW SJL MaxW MaxL AvgW AvgL
count 52298.000000 52298.000000 52035.000000 52037.000000 4731.000000 4731.000000 1791.000000 1791.000000 52074.000000 17506.000000 ... 10671.000000 10671.000000 28131.000000 28142.000000 15572.000000 15579.000000 22745.000000 22745.000000 22745.000000 22745.000000
mean 33.222532 3.372366 5.794331 4.056229 5.777003 3.863454 6.637633 3.756002 2.141760 1.812080 ... 1.815867 3.542479 1.810226 3.451461 1.796538 3.557943 1.998610 8.326076 1.834821 3.594448
std 18.115493 0.778516 1.239577 1.845206 1.274712 1.895683 2.290596 2.817183 0.460311 0.868254 ... 0.996238 3.646316 1.031691 3.075889 1.004273 3.272510 1.628982 397.235666 1.107884 3.282610
min 1.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 ... 1.010000 1.020000 1.000000 1.000000 1.000000 1.010000 1.010000 1.010000 1.010000 1.010000
25% 19.000000 3.000000 6.000000 3.000000 6.000000 2.000000 6.000000 2.000000 2.000000 1.280000 ... 1.240000 1.750000 1.250000 1.730000 1.220000 1.730000 1.290000 1.850000 1.240000 1.740000
50% 33.000000 3.000000 6.000000 4.000000 6.000000 4.000000 6.000000 3.000000 2.000000 1.550000 ... 1.500000 2.500000 1.500000 2.500000 1.500000 2.630000 1.570000 2.780000 1.500000 2.550000
75% 49.000000 3.000000 6.000000 6.000000 6.000000 6.000000 7.000000 5.000000 2.000000 2.050000 ... 2.030000 3.850000 2.000000 4.000000 2.000000 4.000000 2.200000 4.540000 2.060000 3.990000
max 69.000000 5.000000 7.000000 7.000000 7.000000 7.000000 70.000000 68.000000 3.000000 14.000000 ... 18.000000 60.000000 26.000000 51.000000 19.000000 81.000000 76.000000 42586.000000 23.450000 36.440000

8 rows × 36 columns

In [6]:
df4.loc[:,df4.columns[18:]]
Out[6]:
L3 W4 L4 W5 L5 Wsets Lsets CBW CBL GBW ... SJL MaxW MaxL AvgW AvgL Labels Player1 Oponent winner_past_vict loser_past_vict
0 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 76.00 42586.00 23.45 36.44 1 Dosedel S. Ljubicic I. 0.000000 0.000000
1 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 76.00 42586.00 23.45 36.44 0 Clement A. Enqvist T. 0.000000 0.000000
2 3.0 7.0 7.0 70.0 68.0 2.0 1.0 14.0 25.0 7.5 ... 81.0 76.00 42586.00 23.45 36.44 0 Baccanello P. Escude N. 0.000000 0.000000
3 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 76.00 42586.00 23.45 36.44 1 Federer R. Knippschild J. 0.000000 0.000000
4 4.0 7.0 7.0 70.0 68.0 2.0 1.0 14.0 25.0 7.5 ... 81.0 76.00 42586.00 23.45 36.44 1 Fromberg R. Woodbridge T. 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52293 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 1.44 3.40 1.38 3.14 0 Isner J. Zverev A. 0.664032 0.622705
52294 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 1.22 6.03 1.17 5.14 0 Cilic M. Djokovic N. 0.829876 0.652047
52295 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 3.40 1.45 3.14 1.38 0 Federer R. Zverev A. 0.665354 0.829142
52296 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 1.15 7.72 1.12 6.52 0 Anderson K. Djokovic N. 0.830052 0.589792
52297 7.0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 ... 81.0 6.36 1.22 5.69 1.15 0 Djokovic N. Zverev A. 0.666667 0.830228

52298 rows × 40 columns

win percentage at the moment

It could be interessant to calculate the winning percentage from previous matches of each player for each time he plays a match. To do so, we can run the cells right after. But as it takes some time (10~15min) I'll leave commented!

In [7]:
# def prior_wins_percentage(date, player, df, min_games=5):
    
#     df_prior = df[df["Date"] < date]
#     prior_wins = df_prior[df_prior["Winner"] == player].shape[0]
#     prior_losses = df_prior[df_prior["Loser"] == player].shape[0]
    
#     # We set a minimum number of games to avoid extra-high win rates
#     # (e.g. like at a professional career debut)
#     if (prior_wins + prior_losses) < min_games:
#         return 0
#     return prior_wins / (prior_wins + prior_losses)
In [8]:
# df_atp["winner_past_vict"] = df_atp.apply(lambda x: prior_wins_percentage(x["Date"],x["Winner"],df_atp),axis=1)
# df_atp["loser_past_vict"] = df_atp.apply(lambda x: prior_wins_percentage(x["Date"],x["Loser"],df_atp),axis=1)
In [9]:
#data.to_csv("data_4q.csv")

Checking (uncomment la cellule suivante)

In [10]:
#df_atp.tail()

Prediction du matche

To predict who will win a match, we define 3 main functions:

  1. get_data_splits: partition du dataset
  2. train_model: entrainer le modèle
  3. standard: pour standardizer les variables
In [11]:
import lightgbm as lgb
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, RobustScaler, Normalizer
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt 

conf = {}
acu = {}
roc = {}
y_predictions = {}
def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    print(f"Train size : {len(train)}\nValidation size : {len(valid)}\nTest size : {len(test)}")
    return train, valid, test


def train_model(train, valid,test,over,n):
    

    feature_cols = train.columns.drop('Labels')

    dtrain = lgb.Dataset(train[feature_cols], label=train['Labels'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['Labels'])
    
    if over:
        param = {'num_leaves': 31, 'objective': 'binary', "max_depth": 3,
             'metric': 'auc', 'seed': 7, 'reg_alpha':0.5, 'reg_lambda':0.5}
        print(f"Regularization : l1 = {param['reg_alpha']}, l2 = {param['reg_lambda']}")
    else:
        param = {'num_leaves': 64, 'objective': 'binary',
             'metric': 'auc', 'seed': 7}
        print(f"No regularization!")
        
    evals_result = {} 
    bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid,dtrain], 
                    early_stopping_rounds=10, verbose_eval=10, evals_result=evals_result)
    nameModel = "Model " + str(n) +".txt"
    import joblib
    # save model
    joblib.dump(bst, nameModel)
    
    evaluate(bst,feature_cols,evals_result,n)


def evaluate(bst,feature_cols,evals_result,n):
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['Labels'], valid_pred)
    
    test_pred = bst.predict(test[feature_cols])
    test_score = metrics.roc_auc_score(test['Labels'], test_pred)
    
    print(f"Validation AUC score: {valid_score:.4f}")
    print(f"Test AUC score: {test_score:.4f}")
    
    plot(evals_result,valid_pred,bst,feature_cols,test[feature_cols],n)
    
    

def plot(evals_result,valid_pred,bst,feature_cols,test,n):
    global conf
    global auc
    global roc
    global y_predictions
    if evals_result != None:
        acu[n] = evals_result
        fig1 = plt.figure(figsize=(45,10))
    #print('Plot metrics during training... Our metric : ', param["metric"])
    #print("evals_ results : ", evals_result)
        lgb.plot_metric(evals_result, metric='auc',figsize=(35,10))
        plt.xlabel('Iterations',fontsize=20)
        plt.ylabel('auc',fontsize=20)
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
        plt.title("AUC during training",fontsize=20)
        plt.legend(fontsize=20)
        plt.show()


    ##### CONFUSION MATRIX
    th = 0.5
    y_pred_class = valid_pred > th
    y_predictions[n] = y_pred_class
    cm = confusion_matrix(valid["Labels"], y_pred_class)
    tn, fp, fn, tp = cm.ravel()
    fpr = fp / (fp + tn)
    fnr = fn / (tp + fn)
    tnr = tn / (tn + fp)
    tpr = tp / (tp + fn)
    numberModel = n
    conf[n] = {'fpr':f'{fpr:.3f}','fnr': f'{fnr:.3f}', 'tnr' : f'{tnr:.3f}', "tpr": f'{tpr:.3f}'}
    if n > 1 and fpr != 0 and fnr != 0 and tnr != 0 and tpr != 0:
        conf["ratio " + str(n) + "/" + str(n-1)] = {"fp":f'{float(conf[n]["fpr"])/float(conf[n-1]["fpr"]):.3f}', \
                                                    "fn":f'{float(conf[n]["fnr"])/float(conf[n-1]["fnr"]):.3f}', \
                                                    "tn":f'{float(conf[n]["tnr"])/float(conf[n-1]["tnr"]):.3f}', \
                                                    "tp":f'{float(conf[n]["tpr"])/float(conf[n-1]["tpr"]):.3f}'}
    
    fig2 = plt.figure(figsize=(35,10))
    fig2.add_subplot(1,2,1)
    sns.heatmap(cm, annot = True, fmt='d', cmap="Blues", vmin = 0.2,linewidths=.5,annot_kws={"fontsize": 20}); #cbar_kws={"fontsize": 20},annot_kws={"fontsize": 20}
    sns.set(font_scale=2)
    plt.title('Confusion Matrix',fontsize=20)
    plt.ylabel('True Class',fontsize=20)
    plt.xlabel('Predicted Class',fontsize=20)
    plt.xticks(fontsize=20)
    plt.yticks(fontsize=20)
    plt.text(0.1, 0.3, f' FPR: {fpr:.3f}\n FNR: {fnr:.3f}\n TNR: {tnr:.3f}\n TPR: {tpr:.3f}', style='italic',
    bbox={'facecolor': 'white', 'alpha': 0.7, 'pad': 5}, fontsize=14)
    
    
    #Print Area Under Curve
    fig2.add_subplot(1,2,2)

    false_positive_rate, recall, thresholds = roc_curve(valid["Labels"], valid_pred)
    roc_auc = auc(false_positive_rate, recall)
    roc[n] = {'fpr':false_positive_rate,'recall':recall}
    
    plt.title('Receiver Operating Characteristic (ROC)')
    plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1], [0,1], 'r--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.0])
    plt.ylabel('Recall',fontsize=20)
    plt.xlabel('Fall-out (1-Specificity)',fontsize=20)
    plt.xticks(fontsize=20)
    plt.yticks(fontsize=20)
    
    plt.show()    
    display(conf)
    
def standard(df,feat):
    scaler = StandardScaler()
    
    df[[*feat]] = pd.DataFrame(scaler.fit_transform(df.drop(["Labels"], axis=1)),columns=feat) #"winner_past_vict","loser_past_vict"
    #df[[*feat]] = pd.DataFrame(scaler.fit_transform(df),columns=feat)
    return df
In [12]:
df_atp.drop("Comment", axis=1, inplace=True) #### Lets get rid off ths columns because it doesn't seem to useful for us...
# df_atp.columns

Here is where we're gonna create the base from where our model will try to understand what is going on to predict if certain row with all the players variables results in "win" or "lose" (1 or 0). But let's see what happens swapping only the name's variables (that is to say, we're not going to change whatever the other variables are... "WRank", "LPts" etc...). You can imagine that this will kind of mess with the learning process as the machine will struggle to understand to which player the variables belong since we are now swapping the names in only some cases.

In [13]:
df_atp["Labels"] = df_atp.apply(lambda row: 1 if row["Winner"] < row["Loser"] else 0, axis=1)
df_atp["Player1"] = df_atp.apply(lambda row: row["Winner"] if row["Winner"] < row["Loser"] else row["Loser"], axis=1)
df_atp["Oponent"] = df_atp.apply(lambda row: row["Loser"] if row["Winner"] < row["Loser"] else row["Winner"], axis=1)
display(df_atp[["Winner", "Loser", "Labels","Player1","Oponent"]].head(5))
Winner Loser Labels Player1 Oponent
0 Dosedel S. Ljubicic I. 1 Dosedel S. Ljubicic I.
1 Enqvist T. Clement A. 0 Clement A. Enqvist T.
2 Escude N. Baccanello P. 0 Baccanello P. Escude N.
3 Federer R. Knippschild J. 1 Federer R. Knippschild J.
4 Fromberg R. Woodbridge T. 1 Fromberg R. Woodbridge T.

We can see that this method doesn't leave our dataset "umbalanced":

In [14]:
print(df_atp[df_atp["Labels"] == 1].shape[0])
print(df_atp[df_atp["Labels"] == 0].shape[0])
26602
25696

Now lets check the possible errors we main have hidden somewhere in the middle of our dataset!! To do so, we'll try to check the columns that are represented by "dtype=object"!!

In [15]:
cols = df_atp.columns[df_atp.dtypes.eq(object)]
print(cols)
Index(['Location', 'Tournament', 'Series', 'Court', 'Surface', 'Round',
       'Winner', 'Loser', 'WRank', 'LRank', 'W2', 'L2', 'W3', 'L3', 'Lsets',
       'EXW', 'Player1', 'Oponent'],
      dtype='object')

Nice!! But lets keep for now the 6 first as I want to create categorical features from them for the 2nd model!

In [16]:
#df_atp.head()
In [17]:
cols.drop(["Location","Tournament", "Series","Court","Surface","Player1","Oponent","Round"])
Out[17]:
Index(['Winner', 'Loser', 'WRank', 'LRank', 'W2', 'L2', 'W3', 'L3', 'Lsets',
       'EXW'],
      dtype='object')
In [18]:
display(df_atp.isnull().sum())
ATP               0
Location          0
Tournament        0
Date              0
Series            0
Court             0
Surface           0
Round             0
Best of           0
Winner            0
Loser             0
WRank            15
LRank            78
W1              263
L1              261
W2              772
L2              771
W3            28129
L3            28130
W4            47567
L4            47567
W5            50507
L5            50507
Wsets           224
Lsets           225
CBW           34792
CBL           34792
GBW           47243
GBL           47243
IWW           38940
IWL           38940
SBW           46874
SBL           46874
B365W          8655
B365L          8632
B&WW          51201
B&WL          51201
EXW           12887
EXL           12882
PSW           14959
PSL           14959
WPts          16204
LPts          16263
UBW           41627
UBL           41627
LBW           24167
LBL           24156
SJW           36726
SJL           36719
MaxW          29553
MaxL          29553
AvgW          29553
AvgL          29553
Labels            0
Player1           0
Oponent           0
dtype: int64
In [19]:
cols = df_atp.columns[11:].drop(["Labels","Player1","Oponent"])
cols
Out[19]:
Index(['WRank', 'LRank', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5',
       'L5', 'Wsets', 'Lsets', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW',
       'SBL', 'B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL',
       'WPts', 'LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW',
       'MaxL', 'AvgW', 'AvgL'],
      dtype='object')

So lets grab all the columns with errors and apply a function to coerce the errors as "NaN" values. Like that, we can get rid of multiple types of errors like "30,,2" a value that has nothing than a string space " " where it should have a number, for instance. They will all be transformed to "NaN". Then it will be straight forward to manipulate the Nan!

In [20]:
df_atp.loc[:, cols] = df_atp.loc[:, cols].apply(pd.to_numeric, errors='coerce') 

You can do whatever you want with the NaN (replace them with the mean value of each column, the max, min etc...) For a matter of simplicity, I'll replace all of them with the maximum value found on that column:

In [21]:
df_atp = df_atp.dropna(axis=1, how="all")

for each in cols:
    df_atp[each] = df_atp[each].fillna(df_atp[each].max())
df_atp.head()
Out[21]:
ATP Location Tournament Date Series Court Surface Round Best of Winner ... LBL SJW SJL MaxW MaxL AvgW AvgL Labels Player1 Oponent
0 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Dosedel S. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Dosedel S. Ljubicic I.
1 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Enqvist T. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Clement A. Enqvist T.
2 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Escude N. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Baccanello P. Escude N.
3 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Federer R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Federer R. Knippschild J.
4 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Fromberg R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Fromberg R. Woodbridge T.

5 rows × 56 columns

For a matter of checking:

In [22]:
for each in cols[-5:]:
    print(f"max value for {each} : {df_atp[each].max()}")
max value for SJL : 81.0
max value for MaxW : 76.0
max value for MaxL : 42586.0
max value for AvgW : 23.45
max value for AvgL : 36.44

As we'll have 4 models, I'll define a function to drop chosen columns:

In [23]:
#dt = df4.copy()
dt = df_atp.copy()

def drop_cols(df): ### Pour faciliter la tâche, étant donné que nous avions prevu de répéter cette opération plusieurs fois pour d'autres types de modèles.
    drop_col = [x for x in df.columns[0:8]]+["Winner",'Loser',"Player1","Oponent"]
    df.drop(drop_col,axis=1,inplace=True)
In [24]:
#dt.tail()
In [25]:
drop_cols(dt)
# dt.head()
# dt.columns

Standardizer!

In [26]:
st_feat = dt.columns[:].drop("Labels") ## les données a standardizer!
st_feat
Out[26]:
Index(['Best of', 'WRank', 'LRank', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4',
       'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW',
       'IWL', 'SBW', 'SBL', 'B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL',
       'PSW', 'PSL', 'WPts', 'LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL',
       'MaxW', 'MaxL', 'AvgW', 'AvgL'],
      dtype='object')
In [27]:
dt = standard(dt,st_feat)
dt.head()
Out[27]:
Best of WRank LRank W1 L1 W2 L2 W3 L3 W4 ... UBL LBW LBL SJW SJL MaxW MaxL AvgW AvgL Labels
0 -0.478307 0.040706 -0.130024 0.161053 -0.038289 0.166416 -1.010739 0.562662 0.753292 0.212907 ... 0.504987 1.076783 1.0745 0.649582 0.650538 0.876912 0.87722 0.875256 0.869636 1
1 -0.478307 -0.681895 -0.270941 0.161053 -0.578179 0.166416 -0.484007 0.562662 0.753292 0.212907 ... 0.504987 1.076783 1.0745 0.649582 0.650538 0.876912 0.87722 0.875256 0.869636 0
2 -0.478307 -0.245843 3.748563 0.161053 1.581379 0.962612 0.569456 -0.691273 -1.176284 0.212907 ... 0.504987 1.076783 1.0745 0.649582 0.650538 0.876912 0.87722 0.875256 0.869636 0
3 -0.478307 0.065624 -0.062920 0.161053 -1.657957 0.166416 0.042724 0.562662 0.753292 0.212907 ... 0.504987 1.076783 1.0745 0.649582 0.650538 0.876912 0.87722 0.875256 0.869636 1
4 -0.478307 0.264962 0.681930 0.967906 1.041489 -0.629780 1.622919 -0.691273 -0.693890 0.212907 ... 0.504987 1.076783 1.0745 0.649582 0.650538 0.876912 0.87722 0.875256 0.869636 1

5 rows × 44 columns

Mettons enfin le dataset pour entrainer et obtenir le Score AUC pour la validation:

Performance baseline Model :

In [28]:
dt.to_csv("data_model1.csv")
In [29]:
train,valid, test = get_data_splits(dt)
over = False ### Parameter telling if we are overfitting or not!
train_model(train,valid,test,over,1)
Train size : 41840
Validation size : 5229
Test size : 5229
No regularization!
Training until validation scores don't improve for 10 rounds
[10]	training's auc: 0.638469	valid_0's auc: 0.487855
Early stopping, best iteration is:
[4]	training's auc: 0.61001	valid_0's auc: 0.502066
Validation AUC score: 0.5021
Test AUC score: 0.5167
<Figure size 3240x720 with 0 Axes>
{1: {'fpr': '0.668', 'fnr': '0.315', 'tnr': '0.332', 'tpr': '0.685'}}

Surprised with the results? Quite a poor performance, isn't? Lets try to understand it.

  1. as we pointed out, we didn't swapp all the columns accordingly to the player. So the model doesn't know if the "WRank" now belongs to the "Player1" or to the "Oponent".

So lets try now to help our model creating other types of variables. Lets increase the complexitiy of our model. We can expect a better perfomrance.

Catégorization: Label Encoding

Lets see now if creating categorical features with LabelEncoder will help our model understand the data better.

In [30]:
dt = df4.copy()
dt = df_atp.copy()
dt.head()
Out[30]:
ATP Location Tournament Date Series Court Surface Round Best of Winner ... LBL SJW SJL MaxW MaxL AvgW AvgL Labels Player1 Oponent
0 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Dosedel S. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Dosedel S. Ljubicic I.
1 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Enqvist T. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Clement A. Enqvist T.
2 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Escude N. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Baccanello P. Escude N.
3 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Federer R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Federer R. Knippschild J.
4 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Fromberg R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Fromberg R. Woodbridge T.

5 rows × 56 columns

But first, you may ask: why categorize these features? In what they could help the algorithm better predict the outcome of a match? At a first glance, we could not see how a variable that tells the match city could influence the algorithm's outcome...

#

Pourquoi catégorizer ces variables?? En quoi pourraient elles aider l'algorithime à meilleur prévoir le résultat? Bon, à un premier coup d'oeil, nous ne pensons pas qu'une variable qui informe le lieu du matche pourrait avoir une influence sur le résultat.

Mais que diriez vous de votre performance a La Javaness si vous deviez travailler dans une petit chambre 3x4 a la chaleur de 36% et une humidité de 40%? Auriez-vous la même performance? (peut-être un brésilien oui XD )

Dans le même sense, sauriez-vous capable de trouver les bonnes qualités et la valeur ajoutée de ce script ici (oui, je parle bien de tout ce script écrit par un candidat au poste de DS a La Javaness) si vous l'examineriez en Décembre (ne seriez vous tenté aux vacances ou bien au bon repas de Noël? =) La Date peut quand même être important non seulement au niveau personelle de chaque jouer, mais aussi au niveau de sa performance professionelle). Les mêmes considérations peuvent être faites pour les autres variables.

Nous n'affirmons pas que ces informations ci-dessus soient éffectivement présentes dans le dataset. Elles devraient cependant être considerées comme implicites et faisant partie d'une hipothèse plutôt qu'une assertion.

In [31]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
cat = ["Location", "Tournament", "Date", "Series", "Court", "Surface", "Round", "Player1", "Oponent"]

for each in cat:
    le = preprocessing.LabelEncoder()
    LabelEncoder()

    le.fit(dt[each])
    name = each + "_cat"
    ser = le.transform(dt[each])
    dt[name] = ser 
dt.head()
Out[31]:
ATP Location Tournament Date Series Court Surface Round Best of Winner ... Oponent Location_cat Tournament_cat Date_cat Series_cat Court_cat Surface_cat Round_cat Player1_cat Oponent_cat
0 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Dosedel S. ... Ljubicic I. 2 18 0 3 1 3 0 329 514
1 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Enqvist T. ... Enqvist T. 2 18 0 3 1 3 0 236 187
2 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Escude N. ... Escude N. 2 18 0 3 1 3 0 68 192
3 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Federer R. ... Knippschild J. 2 18 0 3 1 3 0 377 430
4 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Fromberg R. ... Woodbridge T. 2 18 0 3 1 3 0 405 1129

5 rows × 65 columns

In [32]:
drop_cols(dt)
#dt.head()
In [33]:
st_feat = dt.columns.drop("Labels")
print(st_feat)
dt = standard(dt,st_feat)
dt.head()
Index(['Best of', 'WRank', 'LRank', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4',
       'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW',
       'IWL', 'SBW', 'SBL', 'B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL',
       'PSW', 'PSL', 'WPts', 'LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL',
       'MaxW', 'MaxL', 'AvgW', 'AvgL', 'Location_cat', 'Tournament_cat',
       'Date_cat', 'Series_cat', 'Court_cat', 'Surface_cat', 'Round_cat',
       'Player1_cat', 'Oponent_cat'],
      dtype='object')
Out[33]:
Best of WRank LRank W1 L1 W2 L2 W3 L3 W4 ... Labels Location_cat Tournament_cat Date_cat Series_cat Court_cat Surface_cat Round_cat Player1_cat Oponent_cat
0 -0.478307 0.040706 -0.130024 0.161053 -0.038289 0.166416 -1.010739 0.562662 0.753292 0.212907 ... 1 -1.865034 -1.425432 -1.321557 0.19432 0.466801 0.875723 -0.7131 -0.392741 -0.596151
1 -0.478307 -0.681895 -0.270941 0.161053 -0.578179 0.166416 -0.484007 0.562662 0.753292 0.212907 ... 0 -1.865034 -1.425432 -1.321557 0.19432 0.466801 0.875723 -0.7131 -0.714569 -1.673817
2 -0.478307 -0.245843 3.748563 0.161053 1.581379 0.962612 0.569456 -0.691273 -1.176284 0.212907 ... 0 -1.865034 -1.425432 -1.321557 0.19432 0.466801 0.875723 -0.7131 -1.295934 -1.657339
3 -0.478307 0.065624 -0.062920 0.161053 -1.657957 0.166416 0.042724 0.562662 0.753292 0.212907 ... 1 -1.865034 -1.425432 -1.321557 0.19432 0.466801 0.875723 -0.7131 -0.226637 -0.872982
4 -0.478307 0.264962 0.681930 0.967906 1.041489 -0.629780 1.622919 -0.691273 -0.693890 0.212907 ... 1 -1.865034 -1.425432 -1.321557 0.19432 0.466801 0.875723 -0.7131 -0.129743 1.430653

5 rows × 53 columns

In [34]:
dt.to_csv("data_model2.csv")

Performance Training Modèle 2 :

In [35]:
train,valid, test = get_data_splits(dt)
train_model(train,valid,test,over,2)
Train size : 41840
Validation size : 5229
Test size : 5229
No regularization!
Training until validation scores don't improve for 10 rounds
[10]	training's auc: 0.773974	valid_0's auc: 0.66868
[20]	training's auc: 0.825064	valid_0's auc: 0.711638
[30]	training's auc: 0.860634	valid_0's auc: 0.726145
Early stopping, best iteration is:
[26]	training's auc: 0.84956	valid_0's auc: 0.72778
Validation AUC score: 0.7278
Test AUC score: 0.6098
<Figure size 3240x720 with 0 Axes>
{1: {'fpr': '0.668', 'fnr': '0.315', 'tnr': '0.332', 'tpr': '0.685'},
 2: {'fpr': '0.340', 'fnr': '0.352', 'tnr': '0.660', 'tpr': '0.648'},
 'ratio 2/1': {'fp': '0.509', 'fn': '1.117', 'tn': '1.988', 'tp': '0.946'}}

So we increased our validation score. As we expected, right? If we compare the values from the Confusion Matrix we can learn that:

  1. we basically maintained the same value of true positives (we are predicting "1" as "1") : "ration 2/1" -- > 'tp' near value 1,0 (0,946)
  2. we are now predicting more accurately the negative cases ("0" as "0") : we almost doubled the TNR as pointed by our "ration 2/1" -- > 'tn' near 2,0
  3. we are now knowing better which samples AREN'T positives : we almost cutted by 2 our FPR. So we are most likely to get that samples with label "0" are NOT "1"
  4. we are missing a bit more some positive cases as negative cases.

We can conclude that this model were more sensible to the negative cases. He undestood better that positive cases were not negative (lower FPR) and negative cases were indeed negative cases (higher TNR)

However, we still have a very poor test score!! Can you tell me why? Well, since the test is also a data never seen by our model, it's quite normal that the model misses the prediction if he's asserting from bad information (in our case the messy-not-swapped-columns). Moreover, if you take into consideration that some players only appear on the test data (check this visualization from Facets Overview in my previous notebook: https://www.kaggle.com/danielfmfurlan/eda-tennis) we will enhance this comprehnesion: bad learning from the training leads to bad predictions for unseen data!

3rd model: discarding some features with RandomForest Classifier

Lets give another chance to our model and try to catch only the most important features (lets see if the RandomForest can see that a lot of the informations going on in our model are indeed messing with the learning and thus should be discarded)

If you want to visualy get a insight about the correlation between the features, uncomment the next cell to see a HeatMap:

In [36]:
# cor = dt.corr()
# import matplotlib.pyplot as plt     
# import seaborn as sns; sns.set()
# plt.figure(figsize=(28,20))

# import numpy as np 
# mask = np.zeros_like(cor)
# mask[np.triu_indices_from(mask)] = True

# with sns.axes_style("white"):
#     f, ax = plt.subplots(figsize=(28, 20))
#     ax = sns.heatmap(cor, center = 0, linewidth = 0.9, vmin = -1, vmax = 1, 
#     cmap =  sns.color_palette("RdBu_r", 7),annot = False, mask=mask, square=True, fmt='.g')
    
# corr_m = cor.abs()
# sol = (corr_m.where(np.triu(np.ones(corr_m.shape), k=1).astype(np.bool))
#                  .stack()
#                  .sort_values(ascending=False))
# print("les 10 variables qui ont une plus forte correlation : \n",sol[:10])
In [37]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt  
import matplotlib.dates as mdates
import matplotlib.cbook as cbook

X = dt.drop("Labels", axis=1)
display(X.columns)
y=dt.loc[:, "Labels"]

feats = RandomForestClassifier(n_jobs=-1)
feats.fit(X, y)

plt.figure(figsize=(10, 10))
imp = feats.feature_importances_
cols = dt.columns.drop("Labels")

imp, cols = zip(*sorted(zip(imp, cols)))

plt.barh(range(len(cols)), imp, align="center", color='blue');
plt.yticks(range(len(cols)), cols,fontsize=10)
plt.xticks(fontsize=10)
plt.title("Variables importants pour la Classification")
plt.xlabel("Relevance (%)",fontsize=12)
plt.tight_layout();

import numpy as np
th = 0.025
imp = np.array(imp)
most = cols[- np.where(imp > th)[0].shape[0] :]

print(f'les variables plus importantes qui influencent plus le résultat :\n {most}')
Index(['Best of', 'WRank', 'LRank', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4',
       'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW',
       'IWL', 'SBW', 'SBL', 'B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL',
       'PSW', 'PSL', 'WPts', 'LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL',
       'MaxW', 'MaxL', 'AvgW', 'AvgL', 'Location_cat', 'Tournament_cat',
       'Date_cat', 'Series_cat', 'Court_cat', 'Surface_cat', 'Round_cat',
       'Player1_cat', 'Oponent_cat'],
      dtype='object')
les variables plus importantes qui influencent plus le résultat :
 ('PSW', 'PSL', 'LPts', 'WPts', 'Location_cat', 'Tournament_cat', 'Date_cat', 'LRank', 'WRank', 'Oponent_cat', 'Player1_cat')

So now we are taking 50% good information and 50% messy information (5 variables come from the Categorization and thus, are swapping-column-prof). We expect a better performance!

In [38]:
dt = dt[[*(most + ("Labels",))]]
dt.head(5)
Out[38]:
PSW PSL LPts WPts Location_cat Tournament_cat Date_cat LRank WRank Oponent_cat Player1_cat Labels
0 1.577375 1.573635 1.474795 1.439918 -1.865034 -1.425432 -1.321557 -0.130024 0.040706 -0.596151 -0.392741 1
1 1.577375 1.573635 1.474795 1.439918 -1.865034 -1.425432 -1.321557 -0.270941 -0.681895 -1.673817 -0.714569 0
2 1.577375 1.573635 1.474795 1.439918 -1.865034 -1.425432 -1.321557 3.748563 -0.245843 -1.657339 -1.295934 0
3 1.577375 1.573635 1.474795 1.439918 -1.865034 -1.425432 -1.321557 -0.062920 0.065624 -0.872982 -0.226637 1
4 1.577375 1.573635 1.474795 1.439918 -1.865034 -1.425432 -1.321557 0.681930 0.264962 1.430653 -0.129743 1
In [39]:
dt.to_csv("data_model3.csv")

Performance Training Modèle 3 :

In [40]:
train,valid, test = get_data_splits(dt)
train_model(train,valid,test,over,3)
Train size : 41840
Validation size : 5229
Test size : 5229
No regularization!
Training until validation scores don't improve for 10 rounds
[10]	training's auc: 0.778073	valid_0's auc: 0.666758
[20]	training's auc: 0.827065	valid_0's auc: 0.701635
[30]	training's auc: 0.854424	valid_0's auc: 0.731156
[40]	training's auc: 0.872808	valid_0's auc: 0.735153
[50]	training's auc: 0.888644	valid_0's auc: 0.737633
[60]	training's auc: 0.89961	valid_0's auc: 0.740385
[70]	training's auc: 0.909744	valid_0's auc: 0.740936
[80]	training's auc: 0.920765	valid_0's auc: 0.746754
Early stopping, best iteration is:
[76]	training's auc: 0.917682	valid_0's auc: 0.747482
Validation AUC score: 0.7475
Test AUC score: 0.6282
<Figure size 3240x720 with 0 Axes>
{1: {'fpr': '0.668', 'fnr': '0.315', 'tnr': '0.332', 'tpr': '0.685'},
 2: {'fpr': '0.340', 'fnr': '0.352', 'tnr': '0.660', 'tpr': '0.648'},
 'ratio 2/1': {'fp': '0.509', 'fn': '1.117', 'tn': '1.988', 'tp': '0.946'},
 3: {'fpr': '0.342', 'fnr': '0.331', 'tnr': '0.658', 'tpr': '0.669'},
 'ratio 3/2': {'fp': '1.006', 'fn': '0.940', 'tn': '0.997', 'tp': '1.032'}}

The validation score increased only a few. We can assess this variation analysing the ROC curve or the confusion matrix. The better performance came from the lower fnr (ratio: and higher tpr (ratio: 1.032) . That is to say: our model was better to understand that some positive cases were indeed positive (labels 1 predicted as 1!) and that some positive cases were NOT negative (labels 1 were not predicted as 0). It seem a bit obvious saying like that but is just a matter to discretize the data into categories. And from that informations we can start thinking, for instance, if there is any difference for the weights of our model when it's a sample with label 1 or label 0.

The good model

Now, let's see what happen if we swap the columns correctly (accordingly to the alphabetic order that we got from the "Winner" and "Loser" columns). We will also have a parameter called "noise" that is to preserve some columns without swapping them (so maintaining a noise to our model, as you'll see that he'll easily overfit)

In [41]:
dt = df_atp.copy()
dt.head()
Out[41]:
ATP Location Tournament Date Series Court Surface Round Best of Winner ... LBL SJW SJL MaxW MaxL AvgW AvgL Labels Player1 Oponent
0 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Dosedel S. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Dosedel S. Ljubicic I.
1 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Enqvist T. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Clement A. Enqvist T.
2 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Escude N. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 0 Baccanello P. Escude N.
3 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Federer R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Federer R. Knippschild J.
4 1 Adelaide Australian Hardcourt Championships 2000-01-03 International Outdoor Hard 1st Round 3 Fromberg R. ... 51.0 19.0 81.0 76.0 42586.0 23.45 36.44 1 Fromberg R. Woodbridge T.

5 rows × 56 columns

In [42]:
dt.columns
Out[42]:
Index(['ATP', 'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface',
       'Round', 'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'W1', 'L1',
       'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW',
       'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW', 'SBL', 'B365W', 'B365L',
       'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL', 'WPts', 'LPts', 'UBW',
       'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW', 'MaxL', 'AvgW', 'AvgL',
       'Labels', 'Player1', 'Oponent'],
      dtype='object')
In [43]:
def noisy(noise,d):
    feat_w_all = ["WRank","W1","W2","W3","W4","W5","Wsets","CBW","GBW","IWW","SBW","B365W","B&WW","EXW","PSW","WPts","UBW","LBW","SJW","MaxW","AvgW"]
    feat_l_all = ["LRank","L1","L2","L3","L4","L5","Lsets","CBL","GBL","IWL","SBL","B365L","B&WL","EXL","PSL","LPts","UBL","LBL","SJL","MaxL","AvgL"]
    per = float(1 - noise)
    #display(feat_w)
    length = len(feat_w_all)
    #display((length))
    ft = int(length*per)
    #display(ft)
    if d == "l":
        feat_w = feat_w_all[:ft]
        feat_l = feat_l_all[:ft]
    if d == "r":
        feat_w = feat_w_all[-ft:]
        feat_l = feat_l_all[-ft:]
        
    print("you are choosing this features to swap : \n",feat_w,"\n",feat_l)
    #return feat_w_all[:ft],feat_l_all[:ft]
    return feat_w, feat_l
In [44]:
def swap(serW,serL):
    for idx,each in serW.items():
    #print(each,idx)
        loserItem = serL[idx]
        name = "winner_" + each
    #print("loser item : ", serL[idx])
        dt[name] = dt.apply(lambda row: row[each] if row["Winner"] < row["Loser"] else row[loserItem], axis=1)


    for idx,each in serL.items():
    #print(each,idx)
        winnerItem = serW[idx]
    #print("loser item : ", serL[idx])
        name = "loser_" + each
        dt[name] = dt.apply(lambda row: row[each] if row["Winner"] < row["Loser"] else row[winnerItem], axis=1)
In [45]:
######################" Define the quantity of noise to add to our model in order to not overfit. Put it in decimal number
noise = 0.8
side = "r" ####### get the features from left "l" or right "r". This is because the WRank and LRank are really important features to our model. 
feat_w,feat_l = noisy(noise,"l")

serW = pd.Series(feat_w)
serL = pd.Series(feat_l)

swap(serW,serL)
you are choosing this features to swap : 
 ['WRank', 'W1', 'W2', 'W3'] 
 ['LRank', 'L1', 'L2', 'L3']
In [46]:
###################################### To create dataset for What-if-Tool
############### Need first to get the df_atp with "Player1" & "Oponent" and get rid of NaN values!

# feat_w = ["WRank"]
# feat_l = ["LRank"]
# serW = pd.Series(feat_w)
# serL = pd.Series(feat_l)

# swap(serW,serL)
# display(dt.columns)
# dt.drop(['W1', 'L1','WRank','LRank',
#        'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW',
#        'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW', 'SBL', 'B365W', 'B365L',
#        'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL','UBW',
#        'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW', 'MaxL', 'AvgW', 'AvgL'],axis=1,inplace=True)
# dt.drop(["Winner","Loser"],axis=1,inplace=True)
# display(dt.columns)
# dt.to_csv("data_model4_wit_swap.csv")
In [47]:
def drop_cols(df): 
    
    drop_col = [x for x in df.columns[0:9]]+["Winner",'Loser',"Player1","Oponent"] + feat_w + feat_l
    df.drop(drop_col,axis=1,inplace=True)
In [48]:
drop_cols(dt)
# dt.to_csv("data_model4_no_std.csv")
display(dt.head())
win_Rank = dt["winner_WRank"].copy()
los_Rank = dt["loser_LRank"].copy()
st_feat = dt.columns.drop("Labels")
print(st_feat)
dt = standard(dt,st_feat)

dt.head()
W4 L4 W5 L5 Wsets Lsets CBW CBL GBW GBL ... AvgL Labels winner_WRank winner_W1 winner_W2 winner_W3 loser_LRank loser_L1 loser_L2 loser_L3
0 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 11.0 ... 36.44 1 63.0 6.0 6.0 7.0 77.0 4.0 2.0 7.0
1 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 11.0 ... 36.44 0 56.0 3.0 3.0 7.0 5.0 6.0 6.0 7.0
2 7.0 7.0 70.0 68.0 2.0 1.0 14.0 25.0 7.5 11.0 ... 36.44 0 655.0 7.0 5.0 3.0 40.0 6.0 7.0 6.0
3 7.0 7.0 70.0 68.0 2.0 0.0 14.0 25.0 7.5 11.0 ... 36.44 1 65.0 6.0 6.0 7.0 87.0 1.0 4.0 7.0
4 7.0 7.0 70.0 68.0 2.0 1.0 14.0 25.0 7.5 11.0 ... 36.44 1 81.0 7.0 5.0 6.0 198.0 6.0 7.0 4.0

5 rows × 43 columns

Index(['W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'CBW', 'CBL', 'GBW', 'GBL',
       'IWW', 'IWL', 'SBW', 'SBL', 'B365W', 'B365L', 'B&WW', 'B&WL', 'EXW',
       'EXL', 'PSW', 'PSL', 'WPts', 'LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW',
       'SJL', 'MaxW', 'MaxL', 'AvgW', 'AvgL', 'winner_WRank', 'winner_W1',
       'winner_W2', 'winner_W3', 'loser_LRank', 'loser_L1', 'loser_L2',
       'loser_L3'],
      dtype='object')
Out[48]:
W4 L4 W5 L5 Wsets Lsets CBW CBL GBW GBL ... AvgL Labels winner_WRank winner_W1 winner_W2 winner_W3 loser_LRank loser_L1 loser_L2 loser_L3
0 0.212907 0.266393 0.188182 0.188122 -0.314302 -0.732394 0.706649 0.704831 0.325069 0.324405 ... 0.869636 1 -0.121657 0.578928 0.601664 0.603664 -0.009125 -0.495011 -1.501316 0.602937
1 0.212907 0.266393 0.188182 0.188122 -0.314302 -0.732394 0.706649 0.704831 0.325069 0.324405 ... 0.869636 0 -0.178328 -1.109808 -1.024420 0.603664 -0.616236 0.605299 0.628288 0.602937
2 0.212907 0.266393 0.188182 0.188122 -0.314302 1.033767 0.706649 0.704831 0.325069 0.324405 ... 0.869636 0 4.671074 1.141840 0.059636 -1.822868 -0.321113 0.605299 1.160689 0.009017
3 0.212907 0.266393 0.188182 0.188122 -0.314302 -0.732394 0.706649 0.704831 0.325069 0.324405 ... 0.869636 1 -0.105465 0.578928 0.601664 0.603664 0.075196 -2.145476 -0.436514 0.602937
4 0.212907 0.266393 0.188182 0.188122 -0.314302 1.033767 0.706649 0.704831 0.325069 0.324405 ... 0.869636 1 0.024068 1.141840 0.059636 -0.002969 1.011160 0.605299 1.160689 -1.178822

5 rows × 43 columns

In [49]:
dt.to_csv("data_model4.csv")
In [50]:
train,valid, test = get_data_splits(dt)
print("len of train : ", len(train))
over = True
train_model(train,valid,test,over,4)
Train size : 41840
Validation size : 5229
Test size : 5229
len of train :  41840
Regularization : l1 = 0.5, l2 = 0.5
Training until validation scores don't improve for 10 rounds
[10]	training's auc: 0.991813	valid_0's auc: 0.989912
[20]	training's auc: 0.9947	valid_0's auc: 0.99328
[30]	training's auc: 0.99595	valid_0's auc: 0.994723
[40]	training's auc: 0.996588	valid_0's auc: 0.995524
[50]	training's auc: 0.997527	valid_0's auc: 0.996732
[60]	training's auc: 0.998284	valid_0's auc: 0.997858
[70]	training's auc: 0.998669	valid_0's auc: 0.998404
[80]	training's auc: 0.998884	valid_0's auc: 0.998653
[90]	training's auc: 0.999041	valid_0's auc: 0.998848
[100]	training's auc: 0.999146	valid_0's auc: 0.998983
[110]	training's auc: 0.999227	valid_0's auc: 0.999091
[120]	training's auc: 0.999292	valid_0's auc: 0.999166
[130]	training's auc: 0.999356	valid_0's auc: 0.999242
[140]	training's auc: 0.999441	valid_0's auc: 0.999376
[150]	training's auc: 0.999478	valid_0's auc: 0.999412
[160]	training's auc: 0.99952	valid_0's auc: 0.999463
[170]	training's auc: 0.999553	valid_0's auc: 0.999503
[180]	training's auc: 0.999603	valid_0's auc: 0.999569
[190]	training's auc: 0.999647	valid_0's auc: 0.999635
[200]	training's auc: 0.999674	valid_0's auc: 0.999649
[210]	training's auc: 0.999694	valid_0's auc: 0.999677
[220]	training's auc: 0.999725	valid_0's auc: 0.99971
[230]	training's auc: 0.999739	valid_0's auc: 0.99972
[240]	training's auc: 0.999761	valid_0's auc: 0.999744
[250]	training's auc: 0.999771	valid_0's auc: 0.999753
[260]	training's auc: 0.999795	valid_0's auc: 0.999782
[270]	training's auc: 0.999804	valid_0's auc: 0.999786
[280]	training's auc: 0.999815	valid_0's auc: 0.999799
[290]	training's auc: 0.999829	valid_0's auc: 0.999808
Early stopping, best iteration is:
[288]	training's auc: 0.999828	valid_0's auc: 0.999808
Validation AUC score: 0.9998
Test AUC score: 0.9998
<Figure size 3240x720 with 0 Axes>
{1: {'fpr': '0.668', 'fnr': '0.315', 'tnr': '0.332', 'tpr': '0.685'},
 2: {'fpr': '0.340', 'fnr': '0.352', 'tnr': '0.660', 'tpr': '0.648'},
 'ratio 2/1': {'fp': '0.509', 'fn': '1.117', 'tn': '1.988', 'tp': '0.946'},
 3: {'fpr': '0.342', 'fnr': '0.331', 'tnr': '0.658', 'tpr': '0.669'},
 'ratio 3/2': {'fp': '1.006', 'fn': '0.940', 'tn': '0.997', 'tp': '1.032'},
 4: {'fpr': '0.007', 'fnr': '0.009', 'tnr': '0.993', 'tpr': '0.991'},
 'ratio 4/3': {'fp': '0.020', 'fn': '0.027', 'tn': '1.509', 'tp': '1.481'}}

You can see from the results that this model overfitted even creating noise and penalizing it with L1=L2=0.5!!!

This shows how LightGBM can fastly converge to the minimum loss of our function. If you want to avoid this overfitting, you can shuffle the data and then you'll see that the performance will fall!!

ROC CURVES FOR ALL MODELS

In [51]:
##### ALL ROC CURVES FROM 4 MODELS
Ax = plt.figure(figsize=(35,10))
colors = ['b','r','g','y']
for idx in roc:

    recall = roc[idx]['recall']
    false_positive_rate = roc[idx]['fpr']
    plt.title('Receiver Operating Characteristic (ROC)')
    plt.plot(false_positive_rate, recall, color=colors[idx-1], label = f"Model {idx}")
    
    plt.legend(loc='lower right')
    plt.plot([0,1], [0,1], 'r--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.0])
    plt.ylabel('Recall',fontsize=20)
    plt.xlabel('Fall-out (1-Specificity)',fontsize=20)
    plt.xticks(fontsize=20)
    plt.yticks(fontsize=20)
plt.show()

If you want to get only some features and so to play with the RandomForest Classifier, go ahead and uncomment the next cell and choose some threshold! (here is 0.025)

In [52]:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd

# xs = {"lab":"TOEFL Score_admitVar", 1:train["TOEFL Score_admitVar"]}
# zs = {"lab":"TOEFL Score",1:train["TOEFL Score"]}
# ys = {"lab":"Chance of Admit",1:train["Chance of Admit "]}

y_pred = y_predictions[4]
fig = px.scatter_3d(valid, x=win_Rank[-5229:], y=los_Rank[-5229:], z=valid["Labels"],
              color=y_pred)
fig.show()

From the above graph we can see that, despite the almost flawless aspect of our model, the errors commited have one commom thing: the LRank and WRank were mostly at the same plateau (that is to say, it is mostly improbable to miss the prediction if the difference from WRank and LRank is too big). But is interesting to observe that the False Negatives (the blue dots among the red ones) came mostly from a narrower delta of WRank than of the LRank (you can see that the blue dots are concetrated btw 23-56) whereas for False Positives (the red dots among the blue dots are more spread through a bigger delta of both LRank AND WRank

In [53]:
# X = dt.drop("Labels", axis=1)
# display(X.columns)
# y=dt.loc[:, "Labels"]

# feats = RandomForestClassifier(n_jobs=-1)
# feats.fit(X, y)

# plt.figure(figsize=(10, 10))
# imp = feats.feature_importances_
# cols = dt.columns.drop("Labels")

# imp, cols = zip(*sorted(zip(imp, cols)))

# plt.barh(range(len(cols)), imp, align="center", color='blue');
# plt.yticks(range(len(cols)), cols)
# plt.title("Variables importants pour la Classification")
# plt.xlabel("Relevance (%)")
# plt.tight_layout();

# import numpy as np
# th = 0.025       ################## THRESHOLD!!! Change if you want!
# imp = np.array(imp)
# most = cols[- np.where(imp > th)[0].shape[0] :]

# print(f'les variables plus importantes qui influencent plus le résultat :\n {most}')
# dt = dt[[*(most + ("Labels",))]]
# display(dt.head(5))
# train,valid, test = get_data_splits(dt)
# train_model(train,valid,test,over,5)

DNN: Keras Classifier

In [54]:
import pandas
from keras.models import Sequential
from keras.layers import Dense
from tensorflow import keras
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


train,valid, test = get_data_splits(dt)

X = train.drop("Labels",axis=1) #[:,0:60].astype(float)
Y = train["Labels"]
# define model
model = Sequential()

model.add(Dense(11, activation='relu'))
model.add(Dense(60,activation='relu'))#input_dim=60
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
history = model.fit(X, Y, epochs=10, batch_size=32, verbose=1)
# evaluate the model
scores = model.evaluate(X, Y, verbose=1)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
# save model and architecture to single file
model.save("model.h5")
print("Saved model to disk")

#### Get validation scores
X_valid = valid.drop("Labels",axis=1)

from sklearn.preprocessing import MinMaxScaler

y_pred = model.predict_proba(X_valid)

print("lenght ynew : ", len(y_pred))
# print("X=%s, Predicted=%s" % (Xnew.iloc[0], ynew[0]))

#### Plot Accuracy graph:
print(history.history.keys())
fig = plt.figure(figsize=(35,10))
# history.history['accuracy']
plt.plot(history.history['accuracy'], color='blue', label='train')

plt.xlabel('Epochs',fontsize=20)
plt.ylabel('auc',fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.title("AUC during training",fontsize=20)
plt.legend(fontsize=20)

plt.show()

### LOAD KERAS MODEL:
# from keras.models import load_model
 
# # load model
# model = load_model('model.h5')
Train size : 41840
Validation size : 5229
Test size : 5229
Epoch 1/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.1388 - accuracy: 0.9394
Epoch 2/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0621 - accuracy: 0.9701
Epoch 3/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0520 - accuracy: 0.9762
Epoch 4/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0448 - accuracy: 0.9815
Epoch 5/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0386 - accuracy: 0.9855
Epoch 6/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0342 - accuracy: 0.9871
Epoch 7/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0309 - accuracy: 0.9880
Epoch 8/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0287 - accuracy: 0.9888
Epoch 9/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0276 - accuracy: 0.9888
Epoch 10/10
1308/1308 [==============================] - 2s 1ms/step - loss: 0.0259 - accuracy: 0.9893
1308/1308 [==============================] - 1s 890us/step - loss: 0.0250 - accuracy: 0.9897
accuracy: 98.97%
Saved model to disk
lenght ynew :  5229
dict_keys(['loss', 'accuracy'])
In [55]:
plot(None,y_pred,None,None,test,5)
{1: {'fpr': '0.668', 'fnr': '0.315', 'tnr': '0.332', 'tpr': '0.685'},
 2: {'fpr': '0.340', 'fnr': '0.352', 'tnr': '0.660', 'tpr': '0.648'},
 'ratio 2/1': {'fp': '0.509', 'fn': '1.117', 'tn': '1.988', 'tp': '0.946'},
 3: {'fpr': '0.342', 'fnr': '0.331', 'tnr': '0.658', 'tpr': '0.669'},
 'ratio 3/2': {'fp': '1.006', 'fn': '0.940', 'tn': '0.997', 'tp': '1.032'},
 4: {'fpr': '0.007', 'fnr': '0.009', 'tnr': '0.993', 'tpr': '0.991'},
 'ratio 4/3': {'fp': '0.020', 'fn': '0.027', 'tn': '1.509', 'tp': '1.481'},
 5: {'fpr': '0.011', 'fnr': '0.014', 'tnr': '0.989', 'tpr': '0.986'},
 'ratio 5/4': {'fp': '1.571', 'fn': '1.556', 'tn': '0.996', 'tp': '0.995'}}
In [56]:
#### Accuracy curve for 3 models
# # acu[n] = evals_result
# ax = plt.figure(figsize=(45,10))
#     #print('Plot metrics during training... Our metric : ', param["metric"])
#     #print("evals_ results : ", evals_result)
# for idx,each in acu.items():

#     #print(acu[idx])
#     lgb.plot_metric(acu[idx], metric='auc',figsize=(35,10))
#     plt.xlabel('Iterations',fontsize=20)
#     plt.ylabel('auc',fontsize=20)
#     plt.xticks(fontsize=20)
#     plt.yticks(fontsize=20)
#     plt.title("AUC during training",fontsize=20)
#     plt.legend(fontsize=20)
# plt.show()
    
In [57]:
# import numpy as np

# df_atp.WRank = pd.to_numeric(df_atp.WRank, errors = 'coerce') 
# df_atp.LRank = pd.to_numeric(df_atp.LRank, errors = 'coerce')
# # New Feature: Rank difference betweehn the 2 oponents
# df_atp['Diff'] =  df_atp.LRank - df_atp.WRank 
# # New Feature: Round the rank difference to 10's and 20's
# df_atp['Round_10'] = 10*round(np.true_divide(df_atp.Diff,10))
# df_atp['Round_20'] = 20*round(np.true_divide(df_atp.Diff,20))
# # New Feature: Total number of sets in the match
# df_atp['Total Sets'] = df_atp.Wsets + df_atp.Lsets

# df_atp['Sets Diff'] = df_atp.W1+df_atp.W2+df_atp.W3+df_atp.W4+df_atp.W5 - (df_atp.L1+df_atp.L2+df_atp.L3+df_atp.L4+df_atp.L5)
# new_df = df_atp

# # 2 New Data Frames: Grand Slam data frame (GS) and non-Grand Slam data frame (non GS)
# df_non_GS = new_df[~(new_df.Series == 'Grand Slam')]
# df_GS = new_df[new_df.Series == 'Grand Slam']

# #%% Winning probability vs Rank Difference
# plt.figure(figsize = (10,10))
# bins = np.arange(10,200,10)
# Gs_prob = []
# non_Gs_prob = []

# for value in bins:
#     pos = value
#     neg = -value
    
#     pos_wins = len(df_GS[df_GS.Round_10 == pos])
#     neg_wins = len(df_GS[df_GS.Round_10 == neg])
#     Gs_prob.append(np.true_divide(pos_wins,pos_wins + neg_wins))
    
#     pos_wins = len(df_non_GS[df_non_GS.Round_10 == pos])
#     neg_wins = len(df_non_GS[df_non_GS.Round_10 == neg])
#     non_Gs_prob.append(np.true_divide(pos_wins,pos_wins + neg_wins))
    
    
# plt.bar(bins,Gs_prob,width = 9, color = 'black') 
# plt.bar(bins,non_Gs_prob,width = 8, color = 'grey')
# plt.title('Probabilité vs Difference de Rank', fontsize = 30)
# plt.xlabel('Difference de Rank',fontsize = 15)
# plt.ylabel('Probabilité de Gagner',fontsize = 15)
# plt.xlim([10,200])
# plt.ylim([0.5,0.9])
# plt.legend(['grand slams', 'Non grand slams'], loc = 1, fontsize = 15)
# plt.show()   
In [58]:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd

fig = px.scatter_3d(df4, x=df4["WRank"], y=df4["LRank"], z=df4["winner_past_vict"],
              color=df4["loser_past_vict"])
fig.show()
In [59]:
df_atp["Surface"].value_counts()
Out[59]:
Hard      27716
Clay      17044
Grass      5844
Carpet     1694
Name: Surface, dtype: int64
In [60]:
y = df_atp["Surface"].unique()
In [61]:
plt.figure(figsize=(15, 10))

#plt.subplot(131)
plt.ylabel("Surfaces")
e = plt.bar(y,df_atp["Surface"].value_counts())#color=['black', 'red', 'green', 'blue', 'cyan','orange','gray'])
#e[0].set_color('r')
#e[4].set_color("g")
plt.xticks()
#plt.ylim(0.0,0.003)
plt.tick_params(axis='x', colors='red')

plt.rc('xtick',labelsize=15)
plt.rc('ytick',labelsize=15)

plt.suptitle('Nombre des matches sur chaque champ')
plt.show()

Ce dataset comprend une participation beaucoup plus importante pour la Surface "Hard". Aurait il une influence le type de surface pour chaque jouer? (a une prochaine analyse)

FIN

Animated graph to see the Rank progression of Federe and Enqvist from 2000 to 2005!

In [62]:
win = df4.groupby("Winner")
fed = win.get_group('Federer R.')
esq = win.get_group("Enqvist T.")
#dos = win.get_group("Dosedel S.")
display(len(fed))
display(len(esq))
#display(len(dos))
#### FED & ESQ

re = pd.DataFrame(columns = fed.columns)
#display(re)
for idx, x in fed["Date"].isin(esq["Date"]).items():
    if x == True:
        re = re.append(fed.iloc[fed.index == idx])
re.head()

reEsq = esq.copy()
reEsq.set_index(reEsq.columns[0])
#display(reEsq)
#reEsq.index
for idx, x in esq["Date"].isin(re["Date"]).items():
    if x == False:
        reEsq.drop(idx,axis=0,inplace=True)
#reEsq.head()
#### FED & ESQ
display(len(reEsq))
display(len(re))
#### FED & ESQ
Final = pd.concat([reEsq,re])
tim = Final.groupby("Date")
#tim.first()
Final['NewDate'] = pd.to_datetime(Final['Date']) ## creating "date" timestamp from the "Date" string
Final['NewDate'] = Final['NewDate'].dt.strftime('%d-%m-%Y')

import plotly.express as px

fig = px.bar(Final, x="Winner", y="WRank", color="Winner",
  animation_frame="NewDate", animation_group="Winner", range_y=[0,200])
fig.show()
1121
137
56
65

Travail futur:

  1. Essayer d'autres modèles de classification (notamment SVM et DNN)
  2. Faire une analyse plus profondie en tenant compte les variations de performances des jouers pour chaque: Surface, Tournament & Location
  3. Probabilité de gangner selon les sets joués (fatigue pourrait influencer plus certains jouers que d'autres)
  4. Explorer l'EDA avec des graphs pour chaque tournament et comparer pour certains jouers